Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. Data-snooping bias is a form of statistical bias that arises from this misuse of statistics. Any relationships found might appear to be valid within the test set but they would have no statistical significance in the wider population.
Data dredging and data-snooping bias can occur when researchers either do not form a hypothesis in advance or narrow the data used to reduce the probability of the sample refuting a specific hypothesis. Although data-snooping bias can occur in any field that uses data mining, it is of particular concern in finance and medical research, both of which make heavy use of data mining techniques.
The process of data mining involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching for combinations of variables that might show a correlation. Conventional tests of statistical significance are based on the probability that an observation arose by chance, and necessarily accept some risk of mistaken test results, called the significance. When large numbers of tests are performed, it is always expected that some will produce false results, hence 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 1% will turn out to be significant at the 1% significance level, and so on, by chance alone.
If enough hypotheses are tested, it is virtually certain that some will falsely appear to be statistically significant, since every data set with any degree of randomness contains some bogus correlations. Researchers using data mining techniques can be easily misled by these apparently significant results, even though they are mere artifacts of random variation.
Circumventing the traditional scientific approach by conducting an experiment without a hypothesis can lead to premature conclusions. Data mining can be used negatively to seek more information from a data set than it actually contains. Failure to adjust existing statistical models when applying them to new datasets can also result in the occurrences of new patterns between different attributes that would otherwise have not shown up. Overfitting, oversearching, overestimation, and attribute selection errors are all actions that can lead to data dredging.
Contents |
The conventional frequentist statistical hypothesis testing procedure is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data, followed by carrying out a statistical significance test to see whether the results could be due to the effects of chance. (The last step is called testing against the null hypothesis).
A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set will contain some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same population, it is impossible to determine if the patterns found are chance patterns. See testing hypotheses suggested by the data.
Here is a simplistic example. Throwing a coin five times, with a result of 2 heads and 3 tails, might lead one to ask why the coin favors tails by fifty percent. On the other hand, forming the hypothesis might lead one to conclude that only a 5-0 or 0-5 result would be very surprising, since the odds are 93.75% against this happening by chance.
As a more visual example, on a cloudy day, try the experiment of looking for figures in the clouds; if one looks long enough one may see castles, cattle, and all sort of fanciful images; but the images are not really in the clouds, as can be easily confirmed by looking at other clouds.
It is important to realize that the alleged statistical significance here is completely spurious – significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definition not a representative data set, and any resulting significance levels are meaningless.
In a list of 367 people, at least two will have the same day and month of birth. Suppose Mary and John both celebrate birthdays on August 7.
Data snooping would, by design, try to find additional similarities between Mary and John, such as:
By going through hundreds or thousands of potential similarities between John and Mary, each having a low probability of being true, we may eventually find proof of virtually any hypothesis.
Perhaps John and Mary are the only two persons in the list who switched minors three times in college, a fact we found out by exhaustively comparing their lives' histories. Our data-snooping bias hypothesis can then become, "People born on August 7 have a much higher chance of switching minors more than twice in college."
The data itself very strongly supports that correlation, since no one with a different birthday had switched minors three times in college.
However, when we turn to the larger sample of the general population and attempt to reproduce the results, we find that there is no statistical correlation between August 7 birthdays and changing college minors more than once. The "fact" exists only for a very small, specific sample, not for the public as a whole.
In meteorology, dataset A is often weather data up to the present, which ensures that, even subconsciously, subset B of the data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.
Suppose that observers note that a particular town appears to be a cancer cluster, but lack a firm hypothesis of why this is so. However, they have access to a large amount of demographic data about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable will be significantly correlated with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a p-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is more likely than not to get at least one null hypothesis with a p-value less than 0.01.
The practice of looking for patterns in data is legitimate; the vice of applying a statistical test of significance (hypothesis testing) to the same data from which the pattern was learned is wrong. One way to construct hypotheses while avoiding the problems of data dredging is to conduct randomized out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset - say, subset A - is examined for creating hypotheses. Once a hypothesis has been formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where such a hypothesis is also supported by B is it reasonable to believe that the hypothesis might be valid.
Another remedy for data dredging is to record the number of all significance tests conducted during the experiment and simply multiply the final significance level by this number (the Bonferroni correction); however, this is a very conservative metric. The use of a false discovery rate is a more sophisticated approach that has become a popular method for control of multiple hypothesis tests.
Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result will be between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.